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INTERPRETING  MULTIPLE  LOGISTIC  REGRESSION 
COEFFICIENTS  IN  PROSPECTIVE  OBSERVATIONAL  STUDIES 


Summary 

t- 

Multiple  logistic  models  are  frequently  used  in  observational  studies  to 
assess  the  contribution  of  risk  factors  to  disease.  In  the  presence  of  correla¬ 
tion  among  risk  factors,  the  estimated  magnitude  of  a  multiple  logistic  coeffi¬ 
cient  can  become  uncertain  or  meaningless.  This  paper  highlights  the  problem 
of  interpreting  a  multiple  logistic  coefficient  and  suggests  a  procedure  for 
examining  the  total  contribution  of  a  risk  factor  to  disease  that  includes  a 
direct  association  and  associations  that  exist  through  relationships  with  other 
antecedent  characteristics.  Examples  are  given,  along  with  results  that  are  not 
immediately  obvious  when  considering  the  multiple  logistic  coefficient  alone. 
Conclusions  that  are  presented  are  important  in  biological  studies  if  isolating 
the  effect  of  an  antecedent  characteristic  is  unreasonable  in  the  presence  of 
confounding  influences. 


Running  head:  Interpreting  Multiple  Logistic  Regression  Coefficients 

Keywords:  multiple  logistic  regression,  prospective  observational  studies,  cor¬ 
relation,  projected  slope. 
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Introduction 


The  multiple  logistic  model  is  a  coasnon  statistical  tool  used  to  analyze 
data  from  prospective  observational  studies  when  the  endpoint  is  a  dichotomous 
variable  (1,2).  For  example,  in  the  Framingham  Heart  Study  (3),  the  endpoint 
of  interest  is  often  whether  or  not  an  individual  develops  coronary  heart 
disease  (CHD) .  If  one  is  interested  in  the  effect  of  triglyceride  (TG)  on  the 
probability  of  developing  CHD,  the  first  step  might  be  to  model  this  effect  by 
a  univariate  logistic  analysis: 

logit ^probability  of  CHD]  =  log[p/(l-p)]  =  Bq+BjCTG) 

As  reported  in  moTe  detail  later  in  this  presentation,  for  Framingham  males,  an 
estimate  of  B^  is  0.437  with  p<0.05,  indicating  that  TG  is  a  significant  univari¬ 
ate  predictor  of  CHD.  One  can  now  easily  estimate  the  probability  of  developing 
CHD  given  an  individuals  TG  value.  Furthermore,  given  two  different  values  of 
TG,  TGj  and  TG2,  one  can  also  compute  the  odds  ratio  of  developing  CHD  based  on 
the  value  of  TGj  relative  to  the  value  of  TG2-  That  is, 

odds  ratio  ■  exp[Bj(TG^-TG2)] 


Note  that  when  Bj  is  significant,  the  odds  ratio  will  be  significan  ly  different 
from  one. 

At  this  point,  most  investigators  would  then  consider  a  more  complete  analy¬ 
sis,  attempting  to  uncover  the  relationship  between  CHD  and  TG  controlling  for 
covariables  such  a  high  density  lipoprotein  cholesterol  (HDL-C) ,  total  choles¬ 
terol  (T-C) ,  and  Metropolitan  relative  weight  (MRW) .  The  investigator  would 
then  fit  the  logistic  model 

logit [p-probability  of  CHD]  »  Bq+Bj  (TG)  ♦B2(HDL-C)+63(T-C)+64(MRW) 


For  the  Framingham  males, 
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8j,  is  -0.183  which  is  not  statistically  significant  (p>.10).  On  the  other 
hand,  the  coefficient  for  HDL-C,  *s  _0-048  which  is  statistically  signifi¬ 
cant  (p<.05).  The  coefficient  for  T-C,  is  0.005  which  is  significant  at  the 
0.10  level  and  the  coefficient  for  MRW,  B^,  is  0.002  which  is  not  significant. 

Having  performed  the  above  analysis,  it  is  quite  natural  for  the  investiga¬ 
tor  to  conclude  that  for  Framingham  males,  while  TG  is  a  significant  univariate 
predictor  of  CHD,  most  of  its  predictive  ability  can  be  explained  through  HDL-C, 
T-C  and  MRW.  This  is  often  phrased  as  something  like,  'TG  is  not  a  significant 
independent  predictor  of  CHD."  The  usual  implied  set  of  conclusions  then  fol¬ 
lows: 

a.  Most  of  the  effects  of  TG  on  CHD  are  explainable  by  HDL-C  and  to  a  lesser 

degree  the  other  covariates. 

b.  TG  is  an  unimportant  variable  in  the  study  of  atherogenesis. 

c.  Altering  TG  to  reduce  CHD  risk  may  be  ineffective. 

One  purpose  of  this  article  is  to  show  that  in  prospective  observational 
studies,  the  three  conclusions  outlined  above  can  result  in  a  misleading  under¬ 
standing  of  the  relationship  of  TG  to  CHD.  This  dilemma  is  often  encountered 
and  discussed  in  terms  of  confounding  or  multicol linearity  (4-7).  Our  attempts 
in  this  presentation  will  be  to  introduce  a  different  perspective  which  will  be 
of  use  to  epidemiologists  in  explaining  the  consequence  of  these  misleading  con¬ 
clusions  and  what  important  information  can  be  salvaged.  The  problem  of  course 
is  that  in  prospective  observational  studies,  the  predictor  variables  such  as 
TG  and  HDL-C  are  likely  to  be  h.ghly  correlated.  For  Framingham  males  the  cor¬ 
relation  is  -0.451  (p<0.05).  This  means  that  for  a  given  level  of  HDL-C,  the 
variation  of  TG  may  be  small,  so  that  it  may  be  unlikely  to  expect  that  for  a 
fixed  value  of  HDL-C  that  TG  should  have  much  of  a  relationship  with  CHD  be¬ 
cause  information  on  TG  is  insufficient.  As  a  result,  the  evidence  is  not 
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available  to  support  the  three  conclusions  given  above.  In  this  instance  and 
in  other  examples  that  will  be  presented,  it  will  be  shown  that  from  prospec¬ 
tive  observational  studies  it  is  often  difficult  to  investigate  the  three  con¬ 
clusions  if  the  studies  are  not  specifically  designed  to  do  so.  If,  for  example, 
all  levels  of  TG  could  be  cross  classified  with  all  levels  of  HDL-C  then  such 
conclusions  are  possible  to  consider.  Cross  classification  of  this  type,  how¬ 
ever,  is  usually  a  goal  of  controlled  clincial  trials  and  is  not  commonly  ex¬ 
perienced  in  observational  type  studies. 

The  second  purpose  of  this  article  is  to  provide  a  simple  method  for  better 
explaining  the  association  of  TG  with  CHD.  We  will  define  a  statistic  called 
the  projected  slope  which  measures  the  effect  of  changing  TG  levels  on  the  pro¬ 
bability  of  developing  CHD,  while  at  the  same  time  considering  the  effect  of  the 
other  covariates  on  CHD  and  the  relationship  between  TG  and  these  covariates. 

The  projected  slope  is  not  new  and  has  appeared  elsewhere  (8).  It  has  been  shown 
to  be  a  useful  statistic  based  on  the  same  ideas  used  in  path  analysis  (9)  for 
linear  regression,  and  can  be  easily  used  in  many  analyses.  We  also  provide 
some  additional  examples  from  the  Framingham  data  on  the  use  of  the  projected 
slope  involving  sets  of  risk  factors  for  predicting  CHD  other  than  those  already 
introduced. 

Along  with  the  purposes  of  the  paper  described  above,  we  acknowledge  that 
the  limitations  of  multiple  logistic  regression  mirror  those  that  are  exhibited 
in  the  usual  least  squares  regression  situation.  The  multiple  logistic  model 
receives  special  emphasis  in  this  paper,  not  because  it  is  characterized  by  any 
unique  statistical  feature  used  in  estimating  parameters,  but  because  it  has 
appeared  in  so  many  investigations  linking  risk  factors  to  disease  (10-15). 
Furthermore,  its  recent  introduction  into  widely  used  statistical  packages  (16, 
17)  has  encouraged  its  use  and  the  attendant  need  for  the  cautions  and  caveats 
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that  are  given  here. 

Interpreting  the  Meaning  of  a  Multiple  Logistic  Coefficient 

Suppose  there  is  interest  in  estimating  a  logistic  expression  for  the  pro¬ 
bability  of  developing  disease  conditional  on  knowing  two  characteristics.  For 
every  individual  examined,  a  bivariate  observation  of  antecedent  characteristics 
can  be  plotted  as  in  Figure  1  along  the  Xj  x2  plane.  As  an  exaaple,  x}  might  be 
TG,  while  x2  might  be  HDL-C.  The  point  (xn*x2i)  represents  the  observations 
for  the  ith  person:  i.e.,  (TG^, HDL-C  j) . 

These  data  points  are  observed  at  the  beginning  of  a  study,  and  when  the 
study  has  terminated  a  tally  of  healthy  and  diseased  individuals  is  made.  All 
of  the  data  are  used  to  provide  estimates  of  coefficients  for  a  logistic  equa¬ 
tion.  The  resulting  equation  describes  a  response  surface  represented  in  Figure 
1  as  a  plane.  The  height  of  the  response  surface  reflects  the  estimated  risk  of 
disease  for  individuals  who  possess  characteristics  directly  below  the  plane. 

For  the  ith  person  with  characteristics  (xn»x2j)*  probability  of  developing 
disease  can  be  estimated  from  the  logistic  equation  and  is  represented  by  the 
height  of  the  arrow  (the  height  falling  somewhere  between  zero  and  one). 

Suppose  that  the  multivariate  logistic  regression  coefficient  associated  with 
Xj(TG)  is  zero,  while  that  associated  with  x2  (HDL-C)  is  negative.  Thus,  x2  is 
said  to  be  inversely  associated  with  disease  while  there  is  no  association  between 
disease  and  x^.  This  interpretation  can  be  easily  described  geometrically  by 
considering  Figure  1.  Note  that  changes  made  parallel  to  the  x1  axis,  when  x2  is 
held  fixed,  do  not  affect  the  chance  of  developing  disease,  corresponding  to  a 
coefficient  of  zero  that  is  associated  with  Xj.  In  contrast,  changes  made  para¬ 
llel  to  the  x2  axis  do  affect  the  chance  of  developing  disease.  In  fact,  for  a 
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given  value  of  xJ(  increases  in  x2  will  reduce  the  height  of  the  response  sur¬ 
face  and  reduce  the  estimated  chance  of  developing  disease,  consistent  with  the 
inverse  association  between  x2  and  disease  that  is  implied  by  the  corresponding 
negative  coefficient  associated  with  x2* 

The  scatter  of  points  in  Figure  1  in  which  Xj  and  x,  are  unrelated  might 
well  be  observed  in  a  controlled  clinical  trial.  In  this  instance,  it  makes 
sense  to  discuss  the  impact  of  holding  one  characteristic  fixed  and  interpreting 
the  importance  of  another  characteristic  through  the  multiple  logistic  coeffi¬ 
cient,  because  all  combinations  of  characteristics  have  been  observed  and  are 
reasonable  to  consider.  In  this  instance,  unlikely  combinations  of  characteris¬ 
tics  are  not  being  created  by  holding  one  characteristic  fixed  and  then  adjusting 
the  other. 

The  data  typical  of  prospective  observational  studies  rarely  result  in  pre¬ 
dictors  Xj  and  x2  which  are  unrelated.  Correlations  between  risk  factors  are 
the  rule  rather  than  the  exception.  Such  an  instance  is  described  in  Figure  2. 
One  can  see  that  the  data  points  represented  in  the  Xj  x2  plane  tend  to  fall 
along  a  line.  For  example,  small  values  of  Xj  are  related  to  large  values  of 
x2<  In  contrast,  there  are  no  data  points  in  which  both  x^  and  x2  are  near  zero. 
Clearly,  it  is  not  meaningful  to  discuss  the  effect  of  changing  x^  while  holding 
x2  fixed,  but  it  is  just  this  assumption  which  is  at  the  heart  of  the  reasoning 
used  to  support  the  three  conclusions  given  in  the  Introduction;  i.e.,  for  given 
levels  of  x2  we  force  unobserved  differences  in  x^^  that  enable  us  to  imply  that 
Xj  is  not  independently  related  to  disease.  We  are  basing  this  decision  on  in¬ 
sufficient  data  that  is  observed  for  fixed  values  of  x2< 

In  Figure  2,  unlike  the  example  in  Figure  1,  the  response  surface  no  longer 
rests  above  a  sample  of  all  combinations  of  values  of  x^  and  x2,  but  behaves 
like  a  teeter-totter  resting  on  a  locus  of  points  projected  up  from  the  observed 
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data.  Note  that  the  only  area  of  the  response  surface  that  has  any  meaning  is 
that  area  directly  above  the  observed  data.  This  is  true  because  the  data  are 
insufficient  to  suggest  that  other  areas  of  the  surface  adequately  represent  the 
chance  of  developing  disease.  As  in  the  usual  linear  regression  problem,  the 
variances  of  the  estimated  multiple  logistic  coefficients  are  potentially  in¬ 
flated  by  the  correlation  between  x^  and  x^,  and  the  instability  of  the  response 
surface,  which  may  result  in  an  uncertain  indication  of  the  importance  of  and 
x2  in  predicting  disease. 

From  Figure  2,  two  components  that  relate  Xj  with  disease  can  be  envisaged. 
The  first  is  a  direct  or  independent  effect  or  association.  The  second  relates 
Xj  with  disease  indirectly  via  an  association  with  x2  and  the  association  x^ 
has  with  disease.  Figure  2  illustrates  that  it  is  not  clear  how  to  interpret 
the  magnitude  of  the  multiple  logistic  coefficient  associated  with  the  slope  of 
the  lines  in  the  response  surface  appearing  in  the  same  planes  as  the  Xj  and  x2 
axes,  because  levels  of  x ^  are  related  to  levels  of  x2.  In  such  an  instance, 
assessing  the  effect  on  disease  by  changing  x^  is  not  realistic  unless  values  of 
x2  are  also  changed  in  a  way  that  is  observed  in  nature.  To  change  Xj  while 
holding  x2  fixed  may  exceed  the  limits  of  the  data  and  may  be  contrary  to  what 
is  possible.  This  is  where  interpretation  of  the  multiple  logistic  coefficient 
of  x^  becomes  misleading  because  the  independent  component  associated  with 
changing  Xj  alone  cannot  be  realistically  separated  from  the  component  represen¬ 
ted  by  the  indirect  association  that  exists  between  Xj  and  x2  and  the  relation¬ 
ship  x2  has  with  disease. 

Thus,  in  the  many  practical  situations  in  which  the  predicting  characteris¬ 
tics  are  highly  correlated,  interpreting  the  multiple  logistic  coefficient  by 
considering  one  characteristic  held  fixed  while  changing  the  other  may  be  unrea¬ 
sonable.  One  may  be  artificially  producing  unlikely  combinations  of  character- 
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istics  and  formulating  extrapolations  that  exceed  the  limitations  of  the  ob¬ 
served  data.  We  feel  that  a  more  useful  analysis  of  the  predictive  importance 
of  a  variable  should  not  hold  constant  the  level  of  another  variable  to  which 
it  is  physiologically  related,  but  rather,  allow  the  characteristics  to  vary 
simultaneously  as  they  would  be  expected  to  biologically.  As  illustrated  in 
Figure  2,  it  would  be  important  to  consider  not  only  the  multiple  logistic  co¬ 
efficients  of  x^  and  X2,  but  also  the  slope  of  the  line  connecting  the  points 
P  and  Q  that  lies  above  most  of  the  data  that  are  observed  and  the  regression 
line  between  Xj  and  x^.  Consideration  of  the  slope  of  the  line  designated  by 
P  and  Q  is  appealling  because  it  is  a  function  of  the  relationship  between  x^ 
and  x^  as  well  as  their  relationships  with  the  disease.  If,  for  example,  the 
characteristic  Xj  is  altered,  on  the  average,  X2  will  also  be  altered,  and  the 
chance  of  developing  disease  will  move  along  the  line  marked  by  P  and  Q.  For 
lack  of  a  better  term,  the  slope  of  the  line  connecting  P  and  Q  when  written  in 
the  logit  scale  will  be  called  the  projected  slope. 

The  Projected  Slope 

The  preceeding  discussion  has  focused  on  the  effects  on  logistic  regression 
due  to  correlation  between  predictor  variables.  This  is,  of  course,  a  special 
circumstance  of  what  has  been  called  multicol linearity  or  confounding,  which  is 
a  general  issue  affecting  all  nonrandomized  studies.  It  is  not  our  purpose 
here  to  become  involved  in  the  controversies  surrounding  the  problem  of  confoun¬ 
ding.  Rather  than  trying  to  discuss  the  independent  effect  of  a  predictor  such 
as  TG,  we  will  use  the  idea  of  the  projected  slope  to  try  to  see  if  a  particular 
variable  has  any  predictive  effect  on  the  probability  of  disease.  As  mentioned 
before,  the  development  is  based  on  the  ideas  of  path  analysis,  which  is  often 
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used  in  linear  regression  but  not  multiple  logistic  regression. 

First,  we  suspect  that  there  may  be  a  linear  relationship  as  in  Figure  2 
between  Xj  and  x2>  Specifically,  we  may  think  that  Xj  can  be  used  to  predict 
*2»  e*8*»  TG  predicting  HDL-C.  This  is  written  symbolically  as 

[1]  x2  -  Yj+YjXj+e 

Conditionally,  once  we  have  observed  Xj  and  x2,  we  hypothesize  a  multiple  lo¬ 
gistic  regression  model: 

[2]  logitfprobability  of  CHD]  *  Bq+BjXj+B2x2  . 

Informally,  we  could  substitute  the  expected  value  of  [1]  into  [2]  obtaining  as 
an  approximation 

logit  [probability  of  CHD]  =  (B0+Y0B2)  +  (Bj+YjB^Xj 


It  turns  out,  that  Bj+YjB2  is  the  projected  slope  associated  with  Xj. 

The  projected  slope  can  be  derived  more  formally  as  follows.  If  we  take 
two  people  exhibiting  predictors  (xjj»x2i)  an<*  (xij»x2j)  that  appear  along  the 
regression  line  between  x^  and  x2  in  Figure  2,  the  log  odds  ratio  of  developing 
disease  for  these  two  people  has  expectation 

It  is  clear  that  if  Bj+YjB2«0,  then  the  slope  of  the  line  connecting  P  and  Q 
shown  in  Figure  2  (that  is  projected  up  from  the  regression  between  Xj  and  x2) 
will  be  zero. 

One  way  to  better  understand  the  meaning  of  the  projected  slope  is  through 
consideration  of  Figures  1  and  2.  In  Figure  1  there  is  no  effect  of  Xj  on  the 
probability  of  disease,  as  we  have  seen  geometrically.  Since  Xj  and  x2  are  un¬ 
related  in  this  figure,  Yjb0.  Further,  the  multiple  logistic  coefficient  for 
x}  is  Bj*0.  This  means  the  projected  slope  is  Bj*YjB2>0,  indicating  no  effect 
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on  the  risk  of  disease  due  to  differences  in  x^.  In  Figure  2,  changing  x1 
should  change  Xj  which  in  turn  will  change  the  risk  of  disease.  Here,  Y^O, 

$2^0*  and  the  projected  slope  is  as  expected. 

Details  of  estimating  the  projected  slope  from  data  as  well  as  its  defi¬ 
nition  when  there  are  more  than  two  predictors  are  provided  in  the  Appendix. 

A  test  of  significance  of  the  projected  slope  is  equivalent  to  testing 
Ho:  81+Yj82*0.  This  tests  whether  or  not  x1  has  any  predictive  effect  on  the 
risk  of  disease.  A  discussion  of  the  mechanics  of  making  this  hypothesis  test 
is  also  provided  in  the  Appendix. 

It  can  also  be  shown  that  in  certain  situations  the  estimated  projected 
slope  for  a  risk  factor  is  asymptotically  equal  to  the  univariate  logistic  re¬ 
gression  coefficient  that  relates  the  risk  factor  to  disease  (7) .  The  asymp¬ 
totic  convergence  of  the  estimated  projected  slope  to  the  univariate  coefficient, 
however,  is  not  guaranteed.  Nevertheless,  the  consequence  is  that  it  can  empha¬ 
size  the  importance  of  the  univariate  coefficient.  The  advantage  of  considering 
the  projected  slope  is  that  in  most  situations  the  variance  is  smaller  than  the 
variance  of  the  univariate  coefficient  derived  from  a  simple  regression  of 
disease  on  the  risk  factor.  Also,  we  are  assuming  a  multivariate  model  and  it 
makes  more  sense  to  refer  to  estimates  from  such  a  model.  An  additional  advan¬ 
tage  is  that  the  projected  slope  provides  a  descriptive  partitioning  of  the 
univariate  coefficient  into  explanatory  segments  that  describe  how  a  risk  fac¬ 
tor  is  related  to  disease  both  directly  and  through  relationships  with  other 
covaTiables.  Notice  in  one  of  the  examples  above  that  the  projected  slope  for 
Xj  was  Yj^f^O.  This  suggests  that  the  magnitude  of  the  univariate  coefficient 
is  solely  attributed  to  the  association  of  Xj  with  x^  and  the  relationship  X2 


has  with  disease. 
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Example  1 

In  Table  1,  the  first  example  using  Framingham  data  is  given  because  it  is 
instructive  and  indicates  a  very  desirable  property  of  the  projected  slope  for 
potentially  protecting  against  the  overemphasis  of  a  statistically  significant 
multiple  logistic  coefficient.  Individuals  in  this  example  are  followed  for 
26  years  beginning  around  1950  for  the  development  of  CUD.  The  predicting 
variables  of  interest  are  height  and  weight.  The  significance  of  the  multi¬ 
variate  coefficient  for  height  suggests  that  for  a  given  weight,  tall  people 
have  a  reduced  chance  of  developing  disease.  If  nothing  were  known  about  the 
relationship  between  height  and  weight,  one  might  conclude  that  height  is  an 
independent  contributor  to  CHD.  If  height  and  weight  were  unrelated  this  would 
be  true.  Height  and  weight,  however,  have  a  correlation  of  0.276  (p<0.05)  so 
that  for  a  given  weight  taller  people  are  leaner,  and  it  is  not  height  that 
effects  CHD  but  the  whole  concept  of  leaness;  i.e.,  height  and  weight  considered 
together.  In  this  instance,  one  should  be  interested  in  the  total  contribution 
of  height  to  CHD;  i.e.,  a  direct  association,  as  well  as  well  as  the  association 
of  height  to  weight.  Here,  the  multiple  logistic  coefficient  for  height  is 
Bj*  -0.098,  for  weight  the  coefficient  is  B2  x  0.012,  and  the  slope  of  the 
regression  between  height  and  weight  is  Yj  *  2.892.  Thus,  the  projected  slope 
is  -0.063  and  it  is  not  significant.  It  is  clear  from  the  form  of  the 

projected  slope  that  the  benefit  of  being  tall  (indicated  by  Bj  »  -0.098)  is 
reduced  by  the  liability  of  increased  weight  that  is  associated  with  being  tall 
(indicated  by  YjB2  *  0.035),  and  that  height  is  not  a  meaningful  contributor  of 
(HD  by  itself. 


11 


Example  2 

Example  2  is  similar  to  example  1  in  terms  of  conclusions  but  is  based  on  a 
more  realistic  application  of  multiple  logistic  analysis.  Here,  T-C,  TG,  HDL-C, 
MRW, systolic  blood  pressure  (SEP),  smoking,  and  age  are  examined  in  our  Framing¬ 
ham  sample  as  risk  factors  for  CHD  with  follow-up  of  subjects  beginning  around 
1972  and  lasting  about  6  years.  There  is  some  belief  that  in  older  age  groups, 
such  as  that  depicted  by  our  sample,  the  relationship  between  T-C  and  CHD  is 
weaker  than  it  is  among  younger  individuals  (18).  Ir.  our  example,  the  univari¬ 
ate  coefficient  for  T-C  is  consistent  with  this  hypothesis  since  it  is  not 
significant.  In  contrast,  the  multiple  logistic  regression  coefficient  for  T-C 
is  significant.  The  latter  implies  that  for  given  levels  of  the  covariables, 
high  levels  of  T-C  significantly  increase  the  chance  of  developing  CHD.  This 
interpretation,  however,  is  misleading  among  our  older  sample  because  differ¬ 
ences  in  T-C  are  commonly  accompanied  by  differences  in  the  other  covariables. 

The  projected  slope  helps  describe  a  more  comprehensive  relationship  be¬ 
tween  T-C  and  CHD.  From  the  Appendix,  a  general  expression  for  the  projected 
slope  of  a  variable  Xj  when  there  are  p  covariables  is  •  -Ypi®p' 

Here,  8^  is  the  multiple  logistic  regression  coefficient  for  the  kth  variable. 

The  coefficient  y^  is  the  slope  coefficient  for  x^  regressed  on  Xj.  For  this 
example,  we  take  x^T-C,  x2=TG,x3=HDL-C,  x4=MRW,x$  =  SBP,  xfi  «  smoking  status,  and 
x^*age.  The  respective  estimates  for  y^j,  k=2,3,...,  7,  are  0.003,  0.030,  0.023, 
-0.005,  0.000,  and  -0.014.  The  respective  estimates  for  8^,  k*l,2....,  7  are 
0.006,  -0.261,  -0.047,  0.006,  0.008,  0.216,  and  0.029.  Thus,  the  projected 
slope  is  0.003  and  more  in  line  with  what  is  implied  by  the  univariate  coeffi¬ 
cient  and  what  is  expected. 
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Among  the  covariables,  it  turns  out  that  HDL-C  is  the  most  consistent  pre¬ 
dictor  of  CHD  (p<0.05)  and  acts  on  CHD  in  a  protective  fashion.  HDL-C  is  also 
correlated  with  T-C.  The  correlation  is  0.092  (p<0.05).  It  would  seem  that 
since  high  levels  of  T-C  are  accompanied  by  elevated  and  protective  levels  of 
HDL-C  that  the  effect  of  T-C  on  CHD  should  be  diminished.  If  we  look  at  the 
contribution  to  the  projected  slope  by  the  relationship  between  T-C  and  HDL-C 
and  the  association  HDL-C  has  with  CHD,  we  see  that  the  misleading  magnitude 
of  the  multiple  logistic  regression  coefficient  associated  with  T-C  (repre¬ 
sented  by  8^0.006)  is  partially  reduced  by  an  amount  equal  to  y318j.  This 
reduction  suggests  that  the  liability  of  possessing  higher  levels  of  T-C  are 
mitigated  by  the  likely  presence  and  beneficial  effects  of  also  possessing 
elevated  levels  of  HDL-C.  Here,  it  is  the  joint  contribution  between  HDL-C 
and  T-C  that  is  important  and  clearly  taken  into  account  by  the  projected 
slope.  Of  course,  relationships  among  the  other  covariables  and  CHD  also  in¬ 
fluence  interpretation  of  the  projected  slope.  These  relationships,  however, 
exist  to  a  much  lesser  degree  and  describing  them  would  be  superfluous . 

Example  3 

We  have  to  this  point  given  examples  indicating  a  useful  property  of  the 
projected  slope  in  interpreting  the  predictive  ability  of  a  risk  factor  when 
its  multiple  logistic  coefficient  is  statistically  significant.  The  pro¬ 
jected  slope,  however,  also  has  the  property  of  potentially  protecting  against 
the  unwarranted  underemphasis  of  a  statistically  insignificant  multiple  logis¬ 
tic  coefficient  as  will  be  shown  in  example  3  using  the  Framingham  data  with  a 
similar  length  of  follow-up  as  example  2. 

The  third  example  was  motivated  by  a  paper  (14)  that  questioned  the 
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relationship  of  TG  with  CHD.  The  paper  highlighted  studies  based  on  multiple 
logistic  regression  models  that  indicated  that  TG  is  an  insignificant  indepen¬ 
dent  predictor  of  CHD.  The  paper  concluded  that  the  treatment  of  hypertri¬ 
glyceridemia  to  alter  the  chance  of  developing  CHD  may  be  ineffective.  The 
result  was  deemed  important  by  the  lay  press  and  prompted  close  examination  of 
the  issue  at  a  workshop  on  hypertriglyceridemia  where  some  of  the  cautions  and 
perspectives  given  in  this  paper  were  presented  (19) . 

In  the  third  example,  the  univariate  coefficient  for  TG  is  significant,  but 
when  HDL-C,  T-C,  and  MRW  are  included  as  covariates,  the  significance  is  re¬ 
duced.  In  fact,  the  magnitude  of  the  multivariate  coefficient  has  become  so 
distorted  as  to  be  negative.  This  finding,  although  enigmatic  at  first,  is 
largely  attributed  to  the  strong  correlation  between  TG  and  each  of  the  covaria¬ 
bles  (p<0.05) .  The  correlation  of  TG  with  HDL-C  was  given  earlier  and  is  -0.451. 
The  correlations  of  TG  with  T-C  and  MRW  are  0.276  and  0.227,  respectively.  The 
direct  interpretation  of  the  multiple  logistic  regression  coefficient  implies 
that  for  fixed  levels  of  the  covariables,  changes  in  TG  do  not  affect  CHD. 

But,  on  the  average,  differences  in  TG  are  often  accompanied  by  differences  in 
all  the  covariables.  At  least  one  of  these  covariables,  HDL-C,  exhibits  a 
strong  relationship  with  CHD  (p<0.05). 

To  improve  our  understanding  of  the  relationship  of  TG  with  CHD  we  again 
compute  the  projected  slope  using  the  notation  in  the  Appendix.  We  first  need 
the  slope  coefficients  of  HDL-C,  T-C,  and  MRW  regressed  on  TG.  These  values 
are,  respectively,  y2i*-11*855»  Y31«22.062,  and  y41*7.267.  We  also  need  the 
corresponding  multiple  logistic  regression  coefficients  for  TG,  HDL-C,  T-C,  and 
MRW.  These  were  given  earlier  in  the  Introduction.  The  estimate  of  the  pro¬ 
jected  slope  is  then  Bj+Y2i&2+Y31®3+Y41®4"0'511,  more  *n  1*ne  w*t*'  8  P°sit*ve 
association  between  TG  and  CHD  that  is  commonly  expected.  The  implication  is 


14 


that  the  physiologic  relationships  between  TG  and  the  covariables  have  changed 
the  magnitude  of  the  importance  of  TG  in  a  multivariate  setting.  Nevertheless 
the  total  contribution  of  TG  to  CHD  that  includes  a  direct  association  and  an 
indirect  relationship  with  CHD  through  the  covariables,  and  especially  HDL-C, 
may  still  be  important.  This  is  clearly  represented  by  the  projected  slope. 
Here,  the  projected  slope,  which  is  significant  (p<0.05),  suggests  that  if  ob¬ 
servational  data  are  useful  for  making  clinical  decisions  that  altering  TG  may 
be  an  effective  means  of  changing  the  risk  to  CHD. 

Conclusion 

In  the  investigation  of  an  association  between  a  characteristic  and  disease, 
it  is  important  to  consider  not  just  significance  of  a  multiple  logistic  regres¬ 
sion  coefficient,  but  the  total  contribution  that  a  characteristic  has  on  the 
development  of  disease.  These  contributions  include  those  that  are  direct  and 
those  that  are  shared  among  relationships  with  other  characteristics.  If  this 
is  not  the  interest,  then  to  isolate  and  understand  the  effect  of  a  characteris¬ 
tic  on  CHD  when  it  could  be  one  of  several  interacting  components  participating 
in  a  biological  mechanism  may  be  difficult. 

The  projected  slope  is  used  as  a  means  to  help  show  that  the  magnitude  of 
the  multiple  logistic  coefficient  is  often  difficult  to  interpret.  The  pro¬ 
jected  slope  is  meant  to  offer  explanation  and  insight  into  the  importance  of 
a  significant  univariate'  coefficient  and  why  a  multivariate  coefficient  has  or 
has  not  achieved  significance  by  way  of  relationships  through  the  covariates 
included  in  a  logistic  expression. 

In  example  1,  the  projected  slope  provided  a  comprehensive  perspective  that 


helped  explain  an  important  relationship  between  height  and  CHD.  In  the  second 
example,  we  discovered  how  the  multiple  logistic  coefficient  for  T-C  can  be  re¬ 
duced,  when  among  older  individuals,  elevated  T-C  may  increase  the  capacity  to 
carry  cholesterol  in  the  high  density  lipoprotein  class  resulting  in  a  dimin¬ 
ished  association  between  T-C  and  CHD.  In  both  of  these  examples,  the  pro¬ 
jected  slope  has  not  only  improved  our  perspective  of  disease  causality,  but 
it  has  also  protected  us  against  the  overemphasis  of  a  statistically  signifi¬ 
cant  multiple  logistic  regression  coefficient.  Furthermore,  in  example  3,  the 
projected  slope  has  also  shown  how  it  can  protect  against  the  unwarranted  un¬ 
deremphasis  of  a  statistically  insignificant  multiple  logistic  regression  coef¬ 
ficient.  Here,  TG  has  the  potential  for  being  thought  of  as  an  innocuous  lipid 
marginally  related  to  disease.  TG,  however,  is  related  to  HDL-C,  the  latter 
of  which  strongly  influences  the  chance  of  developing  CHD.  Unless  this  rela¬ 
tionship  is  taken  into  account  as  it  is  by  the  projected  slope,  the  effect  of  TG 
on  CHD  will  not  be  understood  and  the  benefits  of  reducing  elevated  levels  of 
TG  will  not  be  appreciated. 

This  presentation  has  shown  that  the  magnitude  of  the  multiple  logistic 
regression  coefficient  is  uncertain  when  other  variables  with  a  close  physio¬ 
logic  relationship  are  included  in  the  multiple  logistic  expression  and  that 
awareness  of  this  possibility  is  important.  Furthermore,  attempting  to  iso¬ 
late  independent  contributions  to  disease  by  examining  the  magnitude  of  the 
multiple  logistic  coefficient  may  be  misleading  because  of  the  confounding 
influences  shared  among  covariates.  It  may  also  be  the  case  that  these  latter 
influences  and  their  relationships  with  a  risk  factor  may  define  a  metabolic 
system  that  should  not  be  broken  down  into  components,  but  instead,  considered 
in  its  entirety. 

It  is  apparent  that  even  if  it  wete  realistic  to  isolate  risk  factors,  that 
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to  properly  assess  their  independent  contribution  to  disease  would  require  that 
enough  observations  on  the  risk  factor  be  observed  across  all  levels  of  the 
other  risk  factors.  This  is  often  the  goal  of  controlled  clinical  trials  but 
rarely  ever  occurs  in  nonrand on i zed  or  observational  studies.  It  is  clear  that 
if  we  have  insufficient  data  on  a  variable  for  all  levels  of  the  other  varia¬ 
bles  that  we  will  lack  the  evidence  to  investigate  the  first  two  conclusions 
of  the  Introduction.  Indeed,  the  relationship  of  TG  to  CHD  may  be  partially 
explainable  by  HDL-C,  but  we  lack  the  data  to  say  that  TG  has  an  unimportant 
direct  relationship  with  CHD.  Furthermore,  if  we  do  mistakenly  assume  that 
the  first  two  conclusions  are  true,  we  certainly  cannot  assume  that  the  last 
conclusion  is  also  true.  This  is  most  evident  in  our  example  on  TG  where 
changes  in  TG  affect  the  chance  of  developing  CHD. 

The  examples  we  have  presented  show  clearly  that  the  projected  slope  is  a 
useful  device  when  used  as  a  supplement  for  multiple  logistic  regression  in 
prospective  observational  studies.  With  standard  computer  packages,  it  is  easy 
to  calculate  and  test.  We  believe  the  projected  slope,  similar  as  it  is  to  the 
well  known  area  of  path  analysis,  is  intuitively  easy  to  understand.  While  it 
is  certainly  not  the  only  way  to  deal  with  confounding  and  multicollinearity, 
the  projected  slope  is  a  useful  tool  for  understanding  important  causal  rela¬ 
tionships  between  risk  factors  and  disease. 


APPENDIX 


When  independent  variables  are  correlated  in  linear  regression,  estimating 
parameters  and  reducing  the  error  associated  with  the  estimates  is  often  accom¬ 
plished  by  using  Stein  estimates  or  ridge  regression  (20).  Such  methods  to  date 
are  not  readily  available  for  logistic  regression  and  our  interest  is  in  esti¬ 
mating  not  one  parameter  but  a  function  of  parameters.  In  this  paper,  the  test 
statistic  for  the  significance  of  the  projected  slope  is  approximated  by  hy¬ 
pothesizing  a  joint  model  for  the  ith  class  of  individuals  that  share  the  same 
values  of  x^  and  x^; 

[3]  x2i  =  'Y0+Y1xli*ei  ,  and  conditionally  on  x2i 

[4]  logit [pj (disease)]  «  Bq+BjX  li+82*2i  • 

Model  [3]  represents  the  linear  relationship  between  x ^  and  x2  illustrated  in 
Figure  2.  The  unconditional  expectation  of  the  log  odds  ratio  for  any  two  in¬ 
dividuals  in  classes  i  and  j  falling  on  the  regression  line  between  x^  and  x2 
is: 

logit [pA (disease) ]  -  logit  ^.(disease))  =  (xli”xl  j5 

In  this  expression,  x2i  is  replaced  by  its  expectation,  Vo+Ylxli* 

A  test  of  significance  of  the  projected  slope  is  equivalent  to  testing 
H0:  Nhile  the  obvious  estimates  of  Bj.Bj,  and  Yj  can  be  used  to 

estimate  Bj+YjBj*  the  following  informal  analysis  is  useful  for  computing  a 
test  statistic  for  H  .  In  order  to  test  H  ,  model  [4]  is  rewritten  with  x,. 
replaced  with  to  give  the  following  model. 

[5]  logit [pt (disease)]  ■  fio+6lxli*62ei 
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61  *  61+Y162  *  and 


Thus,  an  equivalent  hypothesis  is  Hq:  =  0. 

To  test  H  ,  the  coefficients  of  model  £5]  are  estimated  by  regressing  the 

a 

estimated  class  logits,  y^logitlpjfdisease)] ,  on  x^  and  e^,  where 

A  A  A-  A 

ej  *  x2i  -  Yq  -  YjXjj,  and  yq  and  Yj  are  the  ordinary  least  squares  estimates 
of  Yq  and  Yj-  Here,  iterative  weighted  least  squares  (2)  is  used  to  estimate 
6j,  <5 j  >  and 

Exact  computations  based  on  the  usual  linear  model  suggest  that  the  esti- 

A 

mate,  6  ,  of  6j  is  approximately  unbiased  (i.e.,  consistent  and  asymptotically 

A 

normal)  for  ^ut  t*ie  estimated  variance  of  underestimates  the  true 

2  2  2 
variance  by  a  factor  of  82ae  •  Here,  B2  *s  from  model  [4]  and  a E  is  the  vari¬ 
ance  of  from  model  [3] . 

2  2 

The  magnitude  of  is  negligible,  however,  in  the  examples  considered 

in  this  paper,  primarily  because  the  value  of  B2  is  frequently  much  less  than 

2 

one  making  the  contribution  of  B2  small.  Comparisons  were  made  with  bootstrap 

methods  of  estimation  (21),  however,  which  give  improved  estimates  of  the  vari- 
A  2  2 

ance  of  5j  and  indicate  that  ignoring  B2ct£  does  not  appreciably  alter  statisti¬ 
cal  results  provided  by  the  simpler  estimation  procedure  given  above. 

The  technique  used  here  is  also  easily  extended  to  the  case  when  several 
independent  variables  are  modeled  in  a  multiple  logistic  equation.  In  this 
instance,  if  Xj,x2,...,  xp  are  antecedent  characteristics,  the  test  of  the 
projected  slope  (which  becomes  projected  in  a  hyperplane) ,  can  be  written  as 
Hq:  8i*Y21B2*Y31B3+...,  YplBp«0.  Here,  Ykl  is  the  slope  coefficient  appearing 
in  the  following  model  for  the  ith  class  of  individuals. 


xki  "  YkO*Yklxli+cki 


19 


The  test  of  Ho  is  then  extended  by  regressing  the  estimated  class  logits, 

/•  a 

yi=logit [pA (disease) ] ,  on  Xj.,  e2.,  e3.,...,  ep.,  where  ekisxki-Yk0-Yklxli ,  and 

A  A 

Y^0  and  Ykj  are  the  least  squares  estimates  of  Ykg  and  Ykj.  As  when  only  two 
antecedent  characteristics  are  included  in  the  logistic  model,  the  test  for 
Ho  is  equivalent  to  testing  the  logistic  coefficient  associated  with  x^  in  the 
reparameterized  analog  to  model  [5] . 
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FIGURE  LEGEND 

Figure  1:  The  Probability  of  Disease,  P(disease),  Predicted  from  Uncorrelated 
Risk  Factors,  and  x2< 

Figure  2:  The  Probability  of  Disease,  P(disease),  Predicted  from  Correlated 
Risk  Factors,  Xj  and  x2. 
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Table  1.  Logistic  Regression  Coefficients  and  Projected  Slopes  for  Selected 
Variables  Used  to  Predict  Coronary  Heart  Disease 


Coefficient  Projected  Characteristics1 


Exampl  e 

Univariate 

Multivariate 

Slope 

Variabl e 

Covariance 

Group 

1 

-0.006 

-0.098* 

-0.063 

Height 

Weight 

Females 

35-44 

2 

0.003 

0.006§ 

0.003 

T-C 

TG 

HDL-C 

MRW 

SBP 

Smoking 

Age 

Males 

50-80 

3 

0.437* 

-0.183 

0.511* 

TG 

HDL-C 

T-C 

MRW 

Males 

50-80 

p<0.05  ^p<0 .10 

1HDL-C  =  high  density  lipoprotein  cholesterol 
MRW  =  Metropolitan  relative  weight 
SBP  =  Systolic  blood  pressure 
TG  *  Tryglyceride 
T-C  »  Total  cholesterol 
Smoking  status  =  yes  or  no 

2 

Intervals  denote  observed  range  of  ages 
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